Release notes
Curated changelog for the scrapingpros Python SDK. Each release lists what changed from the user's perspective — bugs you'd actually hit, features you can use, things that may need attention.
Quick links
| 📦 Install | pip install --upgrade scrapingpros |
| 🐍 PyPI | pypi.org/project/scrapingpros |
| 📚 Docs | docs.scrapingpros.com/docs/category/python-sdk |
| 🔧 API status | api.scrapingpros.com |
| 🆓 Demo token | demo_6x595maoA6GdOdVb (5,000 credits/month, no signup) |
0.7.8 — 2026-05-31
Robustness patch on the v0.7.7 inline-result path. A single malformed inline body could surface as a ValidationError against the whole jobs listing page — losing every job on that page, not just the bad row. v0.7.8 isolates the failure: the offending row's result becomes None and the SDK's existing per-job GET /jobs/{id}/result fallback handles it, exactly as if the server had returned result=None to begin with.
Why
Production reports of pydantic.ValidationError raised from iter_run_jobs(include_results=True) (and transitively from Batch.iter_results()) during active draining. Jobs in transient states can arrive with result shapes that don't strictly match ScrapeResponse — a missing required sub-field, a sub-field with the wrong type, an unexpected payload. Before this patch, that single row aborted the whole page parse.
Changed
-
JobExecutionPublic.resultparses tolerantly. A new@field_validator(mode="before")catches any exception raised while deserializing the nestedScrapeResponseand returnsNone. The well-formed rows on the same page parse normally. The malformed row's body is fetched viaGET /jobs/{id}/resulton demand by_build_result, same as the legacy / oversize / blob-miss fallback paths.Pre-existing behaviour for well-formed inline bodies is unchanged: those still parse to
ScrapeResponseand skip the per-job fetch.
Notes
- Strict superset of v0.7.7. No API changes.
- If you reverted to
include_results=Falseafter hitting the v0.7.7 issue, you can re-enable it on v0.7.8. - The same tolerance does not apply to other listing fields. Anything else that fails to validate (e.g. an unexpected job-level metadata shape) still raises — those would be real schema drift the caller should see.
0.7.7 — 2026-05-31
The polling-efficiency story closes. Two changes that compound: the SDK now consumes the inline result body when the listing brings it, and tightens the counter-based polling so the brief window between "run reached terminal state" and "listing finished seeding" can't drop the last few jobs. Combined with the v0.7.3 (adaptive poll_interval) and v0.7.4 (counter short-circuit) work already on PyPI, a high-volume iter_results() consumer should see its polling traffic drop by roughly an order of magnitude — without any caller-side migration.
Added
-
include_results=opt-in on the jobs listing endpoint —client.iter_run_jobs(...)andclient.get_run_jobs(...)(sync + async) accept a newinclude_results: bool = Falsekwarg. WhenTrue, eachcompletedjob in the page carries its full body in the newJobExecutionPublic.resultfield, eliminating theGET /jobs/{id}/resultper-job round-trip that previously dominated polling traffic. The server caps the page size to 100 when this flag is on (vs 1000 for metadata-only); the SDK mirrors that cap. Bodies above the inline cap (~256 KB), non-completed jobs, and listings from older servers all surface asresult=None— the SDK falls back per-job to the existing/resultendpoint for those.for job in client.iter_run_jobs(cid, rid,
status_filter="completed",
include_results=True):
result = job.result or client.get_job_result(
job.collection_id, job.run_public_id, job.job_public_id
)
process(result, job.custom_id) -
JobExecutionPublic.result: ScrapeResponse | None— the inline body thatinclude_results=Truepopulates.Noneon the metadata-only path and for the fallback cases above. -
RunPublic.all_jobs_persisted: bool | None— exposed in the model for callers that want to read it directly (it signals when the run's listing is fully drainable). The SDK's polling does not key off this field; instead, the iteration completeness guard described below uses the run counters + the set of jobs the SDK has already yielded.
Changed
-
Batch.iter_results(andAsyncBatch.iter_results) automatically use the inline-body path. Internally the polling tick now sendsinclude_results=trueto the listing endpoint and consumesjob.resultdirectly when present, falling back to the per-job/resultonly for the (~5%) of jobs the server marksresult=None. Calling code is unchanged — no migration, no opt-in, no new kwargs. Existingiter_results()loops just see fewer outbound HTTP calls. -
Iteration completeness guard —
Batch.iter_results(and async) only exit the polling loop oncelen(seen_job_ids) >= expected_terminal_counteven if the run's status flipped tocompletedearlier. There is a brief window where the run reportsstatus=completedwhile the listing is still seeding the final few jobs; the SDK previously could break out of the loop and drop them. Now it keeps polling until the listing catches up. Theexpected_terminal_countadapts to theinclude_failedflag (usessuccess_requestsonly when failures are excluded), so callers asking only for successful jobs still exit cleanly. -
Job → result metadata propagation.
result.custom_idandresult.urlare backfilled from the underlyingJobExecutionPublicif the server didn't echo them on the body. Same behaviour regardless of whether the body came inline or via the/resultfallback. Keeps the existing traceability contract intact. -
status_filteron the low-leveliter_run_jobs/get_run_jobs(sync + async) accepts a list or CSV string. Drains multiple terminal states in one paginated stream instead of one call per status. The high-levelBatch.iter_resultsalready drained in one stream internally; this is a public-surface fix for the low-level path:# Before — 3 round-trips:
for status in ("completed", "failed", "timeout"):
for job in client.iter_run_jobs(cid, rid, status_filter=status):
process(job)
# After (v0.7.7+) — 1 paginated stream:
for job in client.iter_run_jobs(
cid, rid,
status_filter=["completed", "failed", "timeout"],
):
process(job)
Impact
For a polling pattern that hits the high-level Batch.iter_results(), the per-tick wire cost evolves like this on a typical run with hundreds of completed jobs:
| Tick component | v0.7.4 | v0.7.7 |
|---|---|---|
GET /runs/{rid} (status) | 1 | 1 |
GET /jobs?... (listing) | 0–1 (counter short-circuit) | 0–1 (counter + persisted short-circuits) |
GET /jobs/{id}/result per completed job | N | ~0 (inline) |
The per-job /result calls were the dominant cost on completed-heavy runs. With v0.7.7 they disappear from the wire on the happy path, and the listing-only short-circuits keep the idle-tick cost at a single status call.
Notes / migration
-
Strict superset of v0.7.5 / v0.7.6 (this release supersedes the unreleased v0.7.6 candidate; both sets of changes ship together). Every existing
iter_results()/iter_run_jobs()/get_run_jobs()call keeps working unchanged. -
Backward compat with older servers:
JobExecutionPublic.resultdefaults toNoneandRunPublic.all_jobs_persistedisbool | None. Servers that don't return these fields still parse cleanly, and the SDK behaviour falls back to the previous polling pattern automatically (per-job/result, no persistence guard). -
What you don't have to do: nothing. The inline-body consumption is internal to
Batch.iter_results— no kwarg to set, no migration. If you're on the low-leveliter_run_jobspath and want the same win, passinclude_results=Trueexplicitly.
0.7.5 — 2026-05-28
Cross-process resume — pass a cursor when reattaching to a batch and the SDK starts strictly after that point instead of re-yielding every job. Closes the last remaining gap that pushed restart-resilient pipelines onto the low-level path, so users who were rolling their own polling loops to get cursor support can now stay on iter_results and inherit the v0.7.3 / v0.7.4 polling optimisations for free.
Added
-
Batch.iter_results(since=...)andAsyncBatch.iter_results(since=...)— cross-process resume cursor. Accepts adatetimeor an ISO 8601 string; jobs withcompleted_at <= sinceare skipped. The SDK uses the server-sidesince_completed_atfilter so the resume pages only new jobs from the wire (no client-side dedup needed). -
Batch.last_completed_at/AsyncBatch.last_completed_at— read-only property exposing the high-water mark the SDK tracks during iteration. Read it after each yielded result and persist alongside(collection_id, run_id)for cross-process resume:for result in batch.iter_results():
save(result)
db.update(cid=batch.collection_id, rid=batch.run_id,
cursor=batch.last_completed_at)
# Different process:
for result in client.iter_results(saved_cid, saved_rid, since=saved_cursor):
save(result) -
since=onclient.iter_results(cid, rid, ...)andAsyncClient.iter_results(cid, rid, ...)— shortcut propagates the kwarg to the underlyingBatch. -
since_completed_at=onclient.iter_run_jobs(...)andclient.get_run_jobs(...)(sync + async) — same primitive exposed on the low-level path. For users who keep a custom polling loop, this closes the parity gap with whatBatch.iter_resultsuses internally. Documented as the canonical "rolling your own polling loop" recipe in Collections (low-level).
Notes / migration
-
Strict superset of v0.7.4. Cero ruptura: every existing
iter_results()/iter_run_jobs()call withoutsincekeeps starting from the beginning, exactly as before. -
Why this matters: pipelines that crash and restart, webhook handlers that reattach inside an HTTP request handler, dashboards that page through a long-running batch — all of these used to either re-process duplicates or write their own paginator around
iter_run_jobs. Now the high-level path supports the pattern natively, and migrating to it brings the v0.7.3 (adaptivepoll_interval) and v0.7.4 (counter short-circuit) optimisations along. -
Concurrent multi-batch draining: the canonical pattern is still
asyncio.gatherover independentiter_resultsgenerators — each batch keeps its own adaptive cadence and short-circuit state, and the SDK doesn't add a coalesceddrain_manybecause per-batch(cid, rid)are separate server endpoints with no bulk-status equivalent. See Batch API → Draining several batches for the 6-line recipe.
0.7.4 — 2026-05-28
Polling-side request reduction: when the run's aggregate terminal counters (success_requests + failed_requests + timeout_requests on RunPublic) haven't moved since the previous tick, iter_results now skips the jobs-page query entirely. The status response alone tells us nothing new happened — querying the jobs page is guaranteed to come back empty, so the round-trip is wasted.
Changed
Batch._collect_new_terminal_jobs/AsyncBatch._collect_new_terminal_jobsshort-circuit on stable counters. Each idle polling tick now costs one request (run status) instead of two (run status + jobs page). Thesince_completed_atfilter still guarantees we'd pick up any missed jobs on the next non-skipped tick, so this is a pure latency / request optimisation with no correctness impact.
Impact
Combined with the v0.7.3 adaptive poll_interval:
| Scenario | v0.7.2 polling | v0.7.3 polling | v0.7.4 polling |
|---|---|---|---|
| 5,000-URL batch, 1 h run, jobs trickle in 20 bursts | ~1,440 req/h (5 s × 2) | ~240 req/h (30 s × 2) | ~140 req/h (30 s × 1 + 20 × 1) |
| 5 such batches in parallel | ~7,200 req/h | ~1,200 req/h | ~700 req/h |
The optimisation is invisible to callers — iter_results yields the same results in the same order; only the polling traffic drops.
Notes
- Pure additive: no API changes, no behaviour changes for callers that pass an explicit
poll_interval. - Safe even if the server's counters lag by one tick: the next non-skipped tick re-queries from the same
since_completed_athigh-water mark and catches everything.
0.7.3 — 2026-05-23
Adaptive poll_interval default — the polling cadence for iter_results and run_and_wait is now sized to the batch instead of a fixed 5 s / 2 s. Small batches stay responsive; long batches stop burning the rate budget on status checks. A pipeline running five 5,000-URL batches in parallel — each iterating at the old 5 s default — used to spend ~3,000 requests per hour just polling. With the new default it spends ~500, leaving the rest of the rate budget for actual scraping.
Changed
-
Batch.iter_results(poll_interval=...)andAsyncBatch.iter_results(poll_interval=...)— default changes from a constant5.0to an adaptive value picked from the batch's queued count:Items in queue Default poll_interval< 1005 s (status quo) 100 – 49910 s 500 – 1,99915 s ≥ 2,00030 s Pass an explicit
poll_interval=Nto override. Same tier table applies toBatch.run_until_complete/AsyncBatch.run_until_completeand to theclient.iter_results(cid, rid)shortcut. -
ScrapingPros.run_and_wait(poll_interval=...)andAsyncScrapingPros.run_and_wait(poll_interval=...)— default changes from2.0to an adaptive value.run_and_waitticks are cheaper thaniter_resultsticks (one status GET vs status + jobs page + N parallel result fetches), so the tier values are smaller:Items in queue Default poll_interval< 1002 s (status quo) 100 – 4995 s 500 – 1,99910 s ≥ 2,00020 s Sized off the just-created run's
total_requests. The 3,600 stimeoutdefault is unchanged.
Added
-
scrapingpros.adaptive_poll_interval(items, *, kind="jobs"|"status")— public helper exposing the tier lookup. Use it to predict the default cadence the SDK will pick for a given batch size, or in your own monitoring loops:from scrapingpros import adaptive_poll_interval
cadence = adaptive_poll_interval(len(my_urls)) # for iter_results
status_cadence = adaptive_poll_interval(len(my_urls), kind="status") # for run_and_wait -
INFO-level log line emitted once per
iter_results/run_and_waitwhen the adaptive default is selected:INFO Batch <run_id>: iterating 1500 item(s) with poll_interval=15s (auto).
Pass poll_interval= to override.Visible if you've set the
scrapingproslogger to INFO; silent at the default WARNING level. Helps surface the cadence in production logs without forcing it on every caller.
Notes / migration
-
No breaking changes. Any code that passed
poll_interval=explicitly is unaffected. Code that relied on the implicit 5 s / 2 s now sees the adaptive value — same correctness, different cadence. -
Same surface, smaller cost on multi-batch pipelines. If you orchestrate several long-running batches in parallel, you'll see the difference most. A single small batch behaves identically to v0.7.2.
-
When to override: pass
poll_interval=Nif you have a latency-sensitive UI (override to a lower value) or want even gentler polling than the default (override higher). The tiers are sized to be safe, not minimum.
0.7.2 — 2026-05-23
Surfaces the per-item validation buckets the server now returns when creating a collection. Before this release, an item rejected for a field-level reason (e.g. custom_id over 255 chars) used to fail the whole submit with HTTP 422; the API now bucketing those rejections into invalid_items and creating the collection with whichever items passed. v0.7.1 silently discarded those buckets — a caller submitting 1,000 items with 3 long custom_ids would see a Batch claiming 1,000 items while the server enqueued 997. v0.7.2 surfaces all of it.
Added
-
Batch.invalid_items(list ofInvalidItem) andAsyncBatch.invalid_items— items the server rejected for Pydantic body-validation or parameter-rule reasons (custom_idtoo long,screenshotwithoutbrowser, etc.). EachInvalidItemcarries its 0-basedindex, theurlif it could be read, and a list ofInvalidItemError(field,error_type,message).batch = client.submit_batch("daily", items)
for it in batch.invalid_items:
print(f" - [{it.index}] {it.url}")
for err in it.errors:
print(f" {err.field}: {err.error_type} — {err.message}") -
Batch.duplicate_urls/AsyncBatch.duplicate_urls— the explicit list of URLs the server skipped as duplicates of an earlier item in the same submit (one entry per skipped occurrence). The legacyduplicates_skippedcount is preserved asBatch.duplicates_skipped. -
Batch.blocked_urls/AsyncBatch.blocked_urls— sameBlockedURLshape thatsubmit_batch_lenientreturned in v0.5.3, now accessible directly on everyBatchso non-lenient callers can inspect them too. -
InvalidItemandInvalidItemErrormodels exported from the top-level package. -
NewCollectionResponse.invalid_itemsandNewCollectionResponse.duplicate_urls— already populated by the server; the SDK model was discarding them before this release. -
Batch.summary()/AsyncBatch.summary()returning a frozenBatchSummarydataclass — the single call to get a complete end-of-run report. Counts both the items the server ran (queued / succeeded / failed) and the items it rejected at submit time (blocked / invalid / duplicates), so a caller no longer has to assemble that picture from four different attributes.str(summary)produces a multi-line ASCII block ready for logs:Batch summary (status: completed)
submitted : 1752
queued : 1749
succeeded: 1701
failed : 48
rejected : 3
blocked : 0
invalid : 3
duplicates : 0Canonical usage inside an
on_completecallback so the report fires automatically when the run terminates:def report(b):
print(b.summary())
batch = client.submit_batch("daily", items).on_complete(report)
for r in batch.iter_results():
handle(r)
# At loop exit, `report` has fired with the full picture.Invariants
BatchSummaryexposes (and the SDK enforces in tests):submitted == queued + blocked + invalid + duplicatesafter submit, andsucceeded + failed == queuedonceis_finished is True. -
Batch.submitted_count/AsyncBatch.submitted_count— the original payload length (what the caller handed tosubmit_batch).Noneon handles built viaclient.get_batch()(reattached after a process restart): the server does not echo back the original submit size onGET /v1/async/collections/{id}yet, so the SDK cannot recover it without persistinglen(payload)alongside the IDs in your own storage. -
get_batch(submitted_count=...)andclient.iter_results(cid, rid, submitted_count=...)— pass-through hint for the reattach case. If you persistedlen(payload)alongside(cid, rid), hand it back when reattaching andbatch.summary()reports the full picture instead ofsubmitted=None. Same kwarg onAsyncClient.get_batch/AsyncClient.iter_results.# On submit: persist alongside the IDs.
db.batches.insert(cid=batch.collection_id, rid=batch.run_id,
submitted=len(items))
# On reattach (webhook, restart, dashboard):
row = db.batches.find(cid=cid)
for r in client.iter_results(cid, rid, submitted_count=row.submitted):
handle(r)
# The on_complete summary now shows the full picture. -
AsyncClient.submit_batch_lenient(name, items)— async counterpart of the sync method. Same contract: returns(batch, blocked), does not emit the partial-successRuntimeWarning, and exposesbatch.invalid_items/batch.duplicate_urlsfor the other two rejection buckets. Closes the asymmetry where the production-first async client lacked the opt-in handler for partial-success.async with AsyncClient(token) as client:
batch, blocked = await client.submit_batch_lenient("daily", items)
for b in blocked:
log.warning("blocked %s (%s)", b.url, b.reason)
async for r in batch.iter_results():
handle(r)
Changed
-
submit_batch(strict) emits aRuntimeWarningwhen the server bucketed any items intoblocked_urls,invalid_items, orduplicate_urls. The warning summarises the counts plus the first invalid item'sfield/error_typeso the actionable detail is in the log without forcing the caller to inspect the batch. Visible by default on CPython; suppress withwarnings.filterwarnings("ignore", category=RuntimeWarning, module="scrapingpros.batch"). The warning is opt-out by design: silent data loss is the failure mode this release closes.Same warning is emitted by
AsyncClient.submit_batch. -
Batch.total(andAsyncBatch.total) now seeds correctly atlen(payload) − len(blocked) − len(invalid) − duplicates_skippedinstead oflen(payload). This affects the value ofbatch.pct,batch.processing_count, andbatch.eta_secondsbefore the first server poll; once polling starts, the server-reportedtotal_requeststakes over (unchanged behaviour). Old code that readbatch.totalimmediately after submit would have seen the wrong value when items were rejected — now it sees the queued count. -
submit_batch_lenientdoes not emit the warning (lenient mode is the explicit opt-in for handling rejections). Its return signature is unchanged —(batch, blocked)— to avoid breaking existing callers. The other two buckets are accessible asbatch.invalid_itemsandbatch.duplicate_urlson the returned handle.
Notes / migration
-
A 100%-valid batch behaves identically to v0.7.1: all three buckets are empty lists, no warning is emitted,
batch.total == len(payload). The change is invisible to callers that don't hit the partial-success path. -
Server-side context: the API extended its partial-success contract from SSRF rejections (
blocked_urls, since v0.5.3) to Pydantic body-validation and parameter rules. The "fail the whole batch on first invalid item" behaviour is gone. -
Why a warning instead of an exception: raising would break clients whose inputs occasionally drift (a CSV with a stray long string in
custom_id). The warning lets the caller see the problem during testing, decide whether to handle it viasubmit_batch_lenientor input cleanup, without forcing every caller to wrap submits in a try/except.
0.7.1 — 2026-05-18
Catches a pool-exhaustion failure mode that survives v0.7.0 when one AsyncClient is shared across a long prep phase and a parallel submit phase. v0.7.0's PoolExhausted correctly names client-pool saturation (vs. a server timeout), but it doesn't fix the underlying cause when the same client is reused across phases.
The failure shape
The script uses one shared AsyncClient for two consecutive phases:
- A "prep" phase: thousands of parallel GETs to fetch per-domain configs.
- A "submit" phase: dozens of parallel
submit_batchcalls right after.
By the time the submit phase starts, the shared httpx pool is full of keepalive / draining connections from the prep phase. Submit requests queue waiting for a free slot, eventually time out — even though the server is healthy.
Reproduction: any script that does ~20+ parallel GETs followed by ~10 parallel submits on one AsyncClient hits this. The fix is per-worker pool isolation.
Added
-
AsyncClient.submit_batches_concurrent(batches, *, concurrency=10)— submit many batches in parallel with per-worker pool isolation. Each worker spawns its own freshAsyncClientinside anasync with, so the submit pool is never starved by stale connections from a prior prep phase on the parent client. Returns alist[AsyncBatch]in input order; each handle is reattached to the parent client soiter_resultsworks after the worker clients close.async with AsyncClient(token) as client:
configs = await client.batch_scrape(catalog_urls) # prep
batches = await client.submit_batches_concurrent( # submit
[(f"daily-{i}", chunk) for i, chunk in enumerate(chunks)],
concurrency=15,
)
for batch in batches:
async for r in batch.iter_results():
save(r)Measured on a reproduction with 25 batches × 500 items each (12,408 items total): submitted in 20 seconds with zero errors on the same script that previously failed with a shared client.
-
SyncClient.submit_batches_concurrent— same surface, backed by aThreadPoolExecutorwith one freshSyncClientper thread. For most production code prefer the async variant; this exists so sync users don't get left behind. -
BatchSpecinput shape —submit_batches_concurrentaccepts either(name, items)tuples or{"name": ..., "items": [...], "callback_url": ...}dicts.itemsis the same shape assubmit_batch(list of URL strings, dicts, orScrapeRequest).
Changed
-
PoolExhaustedmessage now includes pool stats when available, plus the actionable mitigations inline. Example:PoolExhausted: SDK connection pool exhausted (request to /v1/async/collections
never reached the server). Pool state: 100/100 in use, 87 idle / keepalive,
max_keepalive=100.
Likely cause: a long prep phase on this client (many earlier GETs/scrapes) is
holding connections, starving the submit phase.
Mitigation: (1) for parallel batch submits, use
client.submit_batches_concurrent(batches, concurrency=N) — each worker uses
its own fresh client. (2) Or open a fresh AsyncClient just before the submit
phase. (3) Or raise the pool: AsyncClient(token, http_limits=httpx.Limits(
max_connections=500, max_keepalive_connections=200)).The pool-stats lookup is wrapped in
try/except, so a future httpx internal rename can't break the error path — falls back to configured limits only.
Notes / context
-
This is a strict superset of v0.7.0 — no behaviour change for callers who don't use
submit_batches_concurrent. The new method is purely additive. -
The new docs section
"When NOT to share an AsyncClient"(Batch API page) covers the fetch-then-submit anti-pattern with the canonical fix inline. -
All HTTP and DNS in
scrapingpros/goes through httpx's async resolver — nosocket.*/urllib.*blocking calls anywhere in the SDK.
0.7.0 — 2026-05-15
Production-first SDK. We removed the path that doesn't scale and surfaced the failure mode that mimicked a server error. If you used scrape_many, your migration is one word: batch_scrape. Same signature, same return shape, server-side scaling, automatic refunds on soft-blocks, resume after a crash.
Why this release
Three things crystallised at once:
-
scrape_manydoesn't scale. Benchmarked against the Collections-backedbatch_scrapeat production-relevant scale (N=1000 URLs,browser=True) on prod:Method Wall Throughput scrape_many930 s 1.07 URLs/s batch_scrape185 s 5.39 URLs/s submit_batchstreaming215 s 4.66 URLs/s batch_scrapeis 5× faster at this size.scrape_manyopens N parallel connections from your machine to the per-request endpoint; under load your local pool saturates and the per-request endpoint isn't built for sustained fan-out. Collections-backed methods send a single request to the queued endpoint and stream results back as they finish.Beyond the wall-time difference, the Collections-backed methods are also the path that gets you automatic credit refunds on soft-blocks (the validator detects thin / blocked content server-side and refunds without you having to inspect each response). On
scrape_many, soft-blocked content arrives as a 200 with mojibake or empty body and you're billed for it. -
Pool-exhaustion errors used to point at the wrong layer. When a client legitimately ran
asyncio.gatherof manysubmit_batchcalls and hit the SDK's local connection pool ceiling, the failure surfaced asSubmitTimeoutpointing at the API endpoint — so users assumed the server was down or slow. The newPoolExhaustedexception names the actual cause and tells you to raise the pool ceiling. -
SyncClientinside a running event loop is almost always a misuse. It blocks the loop on every call. We now emit aRuntimeWarningwhen detected so developers see the issue during testing instead of debugging slow async apps in production.
Breaking
-
scrape_manyremoved from bothSyncClientandAsyncClient. The method is still defined (so callers don't seeAttributeError) but raisesRuntimeErrorwith the migration recipe inline. The replacements:# Before
results = client.scrape_many(urls, format="markdown", browser=True)
# After (drop-in, returns the same shape)
results = client.batch_scrape(urls, format="markdown", browser=True)
# Or, streaming with live progress:
for result in client.submit_batch("daily", urls).iter_results():
...Users on v0.5.1+ have had a
DeprecationWarningonscrape_manycalls for three releases. v0.7.0 makes the failure mode loud.
Added
-
PoolExhaustedexception. Raised when the SDK's local HTTP connection pool saturates — i.e. the request never left your machine. Distinct fromSubmitTimeout(which means the server failed to respond in time). SubclassesConnectionErrorso existingexcept ConnectionErrorclauses still catch it. The error message points to the fix:from scrapingpros import AsyncClient, PoolExhausted
import httpx
try:
await asyncio.gather(*(client.submit_batch(...) for ...))
except PoolExhausted:
client = AsyncClient(token, http_limits=httpx.Limits(
max_connections=500, max_keepalive_connections=200,
))
# ... retry -
http_limits=constructor argument onSyncClientandAsyncClient. Pass anyhttpx.Limitsto override the default pool ceiling (200 / 100). No more monkey-patchingclient._http. -
RuntimeWarningwhenSyncClientis instantiated inside a running event loop. Catches the common misuse from AI-generated code and async refactors. Doesn't break anything — it's a warning, not an error.
Changed
-
download()deprecated. BothSyncClient.download(url)andAsyncClient.download(url)emitDeprecationWarningruntime. Since v0.6.0,scrape(url, browser=False)returns binary content natively (body_base64,body,save()) and covers the same use case with a richer surface.download()will be removed in a future major release.# Before
result = client.download(url)
data = base64.b64decode(result.content)
# After (recommended)
resp = client.scrape(url, browser=False)
resp.save("file.pdf") # or: resp.body for bytes -
Class-level docstrings rewritten on
SyncClientandAsyncClientto make the canonical pattern obvious:SyncClientfor REPL / one-off scripts,AsyncClientfor production work. Both classes call every endpoint — the "sync" / "async" in the API URL refers to response delivery (inline vs queued), not the Python client.
Notes / migration
-
What if I really need to fan out from the client? You almost never do. The Collections API is faster and more reliable above ~200 URLs. Below that,
submit_batchadds a small fixed overhead — at N=100 withbrowser=True,scrape_manywas ~3× faster in absolute terms (42 s vs 135 s), but you give up refunds, resume, idempotency, and soft-block detection. For new code, prefer the Collections methods even at small N; you'll never have to revisit the choice when N grows. -
Benchmark against your own workload to validate sizing decisions — pick N,
browser=True/False, and target sites that match your real traffic. The 5× ratio above is forN=1000withbrowser=Trueon a mixed real-site sample; small-N or non-browser numbers will look different. -
scrape_manycallers will see aRuntimeErroron the next call after upgrade. The error message contains the migration recipe and a link to these notes; we deliberately did not leave a quietAttributeErrorso the failure is actionable. -
No changes to the wire format. The API endpoints, request/response shapes, and existing client methods are unchanged. This is an SDK-only release.
0.6.0 — 2026-05-14
Binary content support on ScrapeResponse. The API now returns PDFs / images / ZIPs / etc. directly from /scrape with three new fields (content_type, body_base64, body_url); previously the SDK silently discarded them and clients downloading files via /scrape got an empty html. v0.6.0 exposes the fields and adds five helpers modelled on requests.Response so downloading a PDF is a one-liner.
This is a minor bump (0.5.3 → 0.6.0) — zero breaking changes, all additions opt-in.
Added
-
ScrapeResponse.content_type—str | None. Standard HTTP Content-Type of the response (e.g."text/html; charset=utf-8","application/pdf","image/png"). Populated by the API since 2026-05-14;Nonefor legacy responses (treat absence as"text/html"for backward compat). -
ScrapeResponse.body_base64—str | None. Base64-encoded raw response body, populated only when the response is binary. Use thebody/text/savehelpers below to access the content; you almost never need to decode this field yourself. -
ScrapeResponse.body_url—str | None. Reserved for future blob offload of large binary bodies (>5 MB threshold). Currently alwaysNone— tied for forward compatibility so clients usingdownload_body()don't break when the server starts populating it. -
is_binaryproperty —Trueiff the response is binary. Use this to branch before accessinghtml(empty for binary) orcontent(returnsNonefor binary):resp = client.scrape("https://investors.example.com/charter.pdf")
if resp.is_binary:
resp.save("charter.pdf")
else:
print(resp.content) -
bodyproperty —bytes. Mirrorsrequests.Response.content. Always returnsbytes: decodesbody_base64for binary, UTF-8-encodesmarkdownorhtmlfor text. RaisesValueErrorfor offloaded bodies (body_url), pointing you todownload_body(). -
textproperty —str | None. Mirrorsrequests.Response.text. ReturnsNonefor binary so a carelessstr.find()call fails loudly instead of silently parsing base64. -
save(path)— write the body to a file, return bytes written. Works for both text (UTF-8 encoded) and binary:client.scrape(pdf_url).save("out.pdf") -
download_body()(async) +download_body_sync()(sync) — fetch body bytes whether they're inline (body_base64) or offloaded (body_url). The sync variant exists soSyncClientusers don't need an event loop when the blob-offload path eventually lights up.
Changed
contentproperty now returnsNonefor binary responses (was returning the emptyhtmlstring). This is a behaviour change for binary responses only — text responses are unchanged. The previous return value (empty string for binary) was effectively useless, so this is a clarity improvement, not a useful break.
Notes
-
Backward compat: every existing field is preserved, every existing accessor (
html,markdown,guidance,statusCode,contentfor text) behaves identically. Code that didn't touch binary content sees no change. -
MethodPOST.content_typevsScrapeResponse.content_type: both fields are now namedcontent_type. They are semantically distinct (request body encoding"json"/"form", vs. HTTP response Content-Type) and live on different models. Autocomplete will surface both — pick by context. -
body_urlis reserved: the SDK shipsdownload_body()/download_body_sync()now so client code is forward-compatible. When the server starts offloading bodies, no client change is needed.
0.5.3 — 2026-05-04
Consumes eight new server endpoints/contracts. The recovery flow that was partial in v0.5.2 is now end-to-end: a SubmitTimeout followed by a retry no longer duplicates the batch, and find_recent_batch reattaches to the live run in a single round-trip.
Added
-
Idempotency-Keyis now sent automatically onsubmit_batchandsubmit_batch_lenient. The SDK generates a fresh UUID per call so a network-level retry of the same submit (httpx.ReadTimeout, transient 5xx) returns the samecollection_idwithout creating a duplicate run. Passidempotency_key=...to control the key yourself; pass it throughcreate_collectionif you're using the lower-level method directly.Practical impact: after a
SubmitTimeout, safe to retry. The server replays the original response within 24 h. -
client.list_runs(cid)— lists every run of a collection. Available on bothSyncClientandAsyncClient. Backed byGET /v1/async/collections/{cid}/runs(server-side endpoint added 2026-04-30). Optionalstatus_filter="in_progress"or"completed".resp = client.list_runs(cid, status_filter="in_progress")
for run in resp.items:
print(run.run_id, run.status, run.created_at) -
RunListPublicmodel — the response shape returned bylist_runs. Exported from the top-level package. -
Typed 404s on
get_job_result—JobResultPending,JobResultExpired,JobResultLost, andJobNotFound, all inheriting from a newJobResultError(itself inheriting fromAPIError). The SDK parses the structurederror_codethe API now returns and raises the appropriate subclass:from scrapingpros import JobResultPending, JobResultExpired, JobNotFound
try:
r = client.get_job_result(cid, rid, jid)
except JobResultPending:
schedule_retry(jid)
except JobResultExpired:
requeue(jid) # > 24 h since completion
except JobNotFound:
log_bug(jid)Existing
except APIErrorclauses continue to catch them. -
CollectionPublic.created_at+updated_at— bothfloat | None(Unix epoch seconds, UTC). Letsfind_recent_batch(since=...)filter precisely server-side.Noneon older rows that pre-date the field. -
RunPublic.created_at— same shape, available on everyRunPublicreturned by the API. -
BlockedURLmodel — describes a URL the API refused to enqueue when creating a collection. Exposed onNewCollectionResponse.blocked_urls. Categorised byreason(private_ip,invalid_protocol,dns_failed,blocked_hostname,invalid_port,malformed_url,blocked).
Changed
-
find_recent_batch(name, since=...)now uses the server-side?name=and?since=filters in a single round-trip (was scanning every collection client-side in v0.5.2). It also reattaches to the live run of the recovered collection by callinglist_runs(cid, status_filter="in_progress"), so the returnedBatchis fully usable — no longer a partial handle withrun_id="". -
submit_batch_lenientrewritten around the new server contract. ReadsNewCollectionResponse.blocked_urlsdirectly instead of parsing HTTP 400 detail strings and retrying one URL at a time.Signature change: returns
tuple[Batch, list[BlockedURL]](wastuple[Batch, list[dict]]in v0.5.2). Themax_dropsparameter is gone — no longer a retry loop. Addidempotency_key=...if you want explicit control. -
get_job_result404 path — was raisingAPIError(404, ...), now raises aJobResultErrorsubclass (which stillisinstancechecks asAPIError).
Notes / migration
-
submit_batch_lenientis a behavior-compatible breaking change for the second tuple element: its type went fromlist[dict[str, Any]]tolist[BlockedURL]. Replacedropped["url"]/dropped["__rejection_reason__"]accesses withdropped.url/dropped.reason. -
Idempotency-Keyis on by default. Passidempotency_key=...to control the key yourself (e.g. derive it from your DB row id for a reproducible retry). -
Older collections / runs may return
nullforcreated_atandupdated_at. The SDK fields arefloat | Noneto tolerate that.
0.5.2 — 2026-04-30
Resilience and recovery release. Addresses fifteen concerns reported by integrators, focused on three failure modes: batches surviving transient API degradation, recovering from submit_batch timeouts without creating duplicate runs, and giving users the right tools the first time.
Fixed
-
Batch.iter_results()no longer dies on transient HTTP 500 / 502 / 503 / 504. Previously the polling loop only caughtConnectionError; any 5xx response raisedAPIErrorand crashed the iterator while the batch kept running server-side. Now polling distinguishes transient errors (5xx, 429, network drops) from real semantic failures (4xx) and rides out API hiccups by retrying on the next tick. The high-water mark is preserved so no progress is lost. Same fix onAsyncBatch.iter_results(). -
client.get_batch(cid, rid)now refreshes counters on construction by default, sobatch.total,batch.success_count,batch.pct, etc. are populated immediately instead of staying at0until the first iteration tick. Passrefresh=Falseto skip the round-trip. -
submit_batch()validates URLs in dict items client-side. Previously a{"url": None}or{"url": ""}would create a collection that fails downstream in a worker with an unhelpful'NoneType' object has no attribute 'lower'. Now you get aValueErrorpointing at the input index before the request is sent.
Added
-
SubmitTimeoutexception (subclass ofTimeoutError) raised whensubmit_batch()cannot reach the API in time. Distinct from polling timeouts: the batch was never created, so the message tells you what to do (search for orphans before retrying). Exported from the top-level package.from scrapingpros import SyncClient, SubmitTimeout
try:
batch = client.submit_batch(name, items)
except SubmitTimeout:
orphan = client.find_recent_batch(name=name)
... -
submit_batch(submit_timeout=30.0)— dedicated short timeout for the submit round-trip, separate from the 120s default that covers individual scrape requests. Fail fast during API degradation instead of hanging two minutes. -
submit_batch(on_submitted=fn)callback — fires inside the SDK call right after the collection (and again after the run) is created server-side, so you can persist(collection_id, run_id)to disk before any code that might crash.fn(collection_id, run_id_or_none)is invoked twice; on the async client it can be a coroutine.def remember(cid, rid):
Path("ids.json").write_text(json.dumps({"cid": cid, "rid": rid}))
batch = client.submit_batch(name, items, on_submitted=remember) -
client.find_recent_batch(name, since=None)— orphan recovery helper. After aSubmitTimeout, looks up a recently created collection by exact name match and returns aBatchhandle pointing at it. Use a uniquenameper submit (e.g. with a UUID suffix) for this to be reliable. Filtering bysinceis accepted for forward compatibility once the server starts returningcreated_aton collections. -
client.iter_results(cid, rid)convenience shortcut — equivalent toclient.get_batch(cid, rid, refresh=False).iter_results(...). Lets you stream results from a persisted(cid, rid)pair without dealing with theBatchhandle. Available on bothSyncClientandAsyncClient. -
Batch.refresh()— public method to force a one-shot refresh of progress counters. Useful for monitoring loops, dashboards, or recovery scripts that want a snapshot without entering aniter_results()loop. Returnsselfso you can chain. Same onAsyncBatch.refresh(). -
client.submit_batch_lenient(name, items)— variant ofsubmit_batchthat drops URLs the API rejects (private IPs from DNS resolution, takedown redirects, etc.) and retries until the batch is accepted. Returns(batch, dropped). Useful when working with a pool of URLs that occasionally has flaky DNS. Sync client only for now.
Changed
-
Retry log demoted from WARNING to INFO. Under load, the SDK can fire dozens of retry log lines per minute (
Request POST /v1/sync/scrape returned 500, retrying ...). Previously emitted at WARNING, drowning legitimate WARNINGs in the caller's output. Now at INFO so the defaultWARNINGlevel silences them; lower the SDK logger to INFO when you need visibility:import logging
logging.getLogger("scrapingpros").setLevel(logging.INFO)No SDK code uses
print()directly, and the SDK never callslogging.basicConfig()— your logger configuration is fully respected.
Notes
-
Idempotency keys for
submit_batchare not yet available server-side. Until they ship,find_recent_batch+ a uniquenameper submit (UUID suffix) is the recommended pattern to avoid duplicates after aSubmitTimeout. -
A few of the underlying capabilities (a
runslisting endpoint,created_aton collections, structuredstatus_filteron/jobs) need server-side work before the SDK can fully close some recovery paths. SDK-side workarounds are in place where possible; the missing pieces are tracked for follow-up.
0.5.1 — 2026-04-29
Naming and discoverability release. No new functionality on the wire — this version makes the recommended way to scale to many URLs much easier to find, and pushes back on patterns that don't scale.
Why this release exists
Three different mechanisms exist for scraping multiple pages, and the names made it easy to pick the wrong one:
scrape()— one URL at a time, blocking.scrape_many()— opens N concurrent HTTP connections from your machine to/v1/sync/scrape. Doesn't scale.submit_batch()/ Collections API — sends one request, the server runs the batch with optimised concurrency. Scales to 50,000+ URLs.
Users repeatedly reached for scrape_many() (or thought switching to AsyncScrapingPros would magically improve throughput) when they actually wanted the Collections API. This release marks the misleading paths as deprecated and adds the tools that make the right path obvious.
(The original v0.5.1 release notes pointed at a "Choosing the right method" page that was folded into the Batch API doc's FAQ in v0.7.0.)
Added
-
SyncClientandAsyncClient— preferred names for the existing client classes. Same surface, same behaviour. The new names clarify that the only real difference between the two classes is the local I/O loop, not which API endpoints they can reach (both can call sync and Collections).# Old
from scrapingpros import ScrapingPros, AsyncScrapingPros
# New — recommended
from scrapingpros import SyncClient, AsyncClient -
client.batch_scrape(urls, ...)— convenience wrapper aroundsubmit_batch+iter_resultsthat blocks until the batch completes and returns a flatlist[ScrapeResponse]. Drop-in replacement forscrape_manywhen you want server-side scaling without writing the streaming loop.# Same shape as scrape_many, but uses the Collections API under the hood
results = client.batch_scrape([
{"url": u, "custom_id": product_id, "browser": True}
for product_id, u in catalog.items()
])Available on both
SyncClientandAsyncClient.
Changed
run_and_wait()defaulttimeoutis now3600seconds (1 hour), up from300. The old default was timing out legitimate browser/stealth runs that take 20+ minutes legitimately. If you relied on the 300 s default to cap runaway runs, passtimeout=300explicitly. (BothSyncClient.run_and_waitandAsyncClient.run_and_wait.)
Deprecated
All deprecations emit a DeprecationWarning at runtime (one-shot, location-deduped — Python's standard). Behaviour is unchanged; replacements work today. Removal targeted for v1.0.
-
ScrapingPros→ useSyncClientinstead. Identical class behaviour; the rename clears up the false implication that it only spoke to the sync endpoint. -
AsyncScrapingPros→ useAsyncClientinstead. -
scrape_many()→ usebatch_scrape()(list return) orsubmit_batch()+iter_results()(streaming). Server-side parallelism instead of N parallel HTTP connections from your machine.# This still works in v0.5.x but emits a DeprecationWarning
results = client.scrape_many(urls)
# Prefer one of these instead
results = client.batch_scrape(urls) # blocking, returns list
for r in client.submit_batch("name", urls).iter_results():
... # streaming, with progress
Notes
DeprecationWarningis filtered to "default" by Python — most production code will not see it. Tests run with strict warning filters (e.g.filterwarnings = errorin pytest) will treat it as an error; either migrate to the new names or filter the specific warning.- The
ScrapingProsandAsyncScrapingProssymbols still resolve normally, are still in__all__, and continue to work identically. Existing code does not need to change immediately.
0.5.0 — 2026-04-29
Feature release adding form-encoded POST, response body capture, attached-state waits, and richer per-job metadata. No breaking changes; two soft deprecations.
Fixed
MethodPOSTno longer silently dropscontent_type. Passingcontent_type="form"toMethodPOSTnow correctly sends the body asapplication/x-www-form-urlencoded. Previously the parameter was discarded by the SDK and the body always went out as JSON, which broke OAuth2grant_type=client_credentialsflows and any other API that requires form-encoded payloads — the server would respond400 invalid Content-Typeand the request would fail with no obvious cause. If your scraper authenticated against an OAuth2 token endpoint and was getting empty results, upgrading to 0.5.0 fixes it.
Added
-
MethodPOST.content_type— choose"json"(default, unchanged behaviour) or"form". Use"form"for OAuth2 client_credentials flows and most legacy form-based APIs.from scrapingpros import MethodPOST
resp = client.scrape(
"https://api.example.com/v1/oauth2/token",
http_method=MethodPOST(
payload={"grant_type": "client_credentials", "scope": "read"},
content_type="form",
),
) -
WaitForSelectorAction.state— accepts"visible"(server default),"attached", or"hidden". Pass"attached"to match hidden DOM nodes such as<script id="__NEXT_DATA__">tags carrying embedded JSON, which the default visible-wait would never resolve and would always time out.from scrapingpros import WaitForSelectorAction
result = client.scrape(url, browser=True, actions=[
WaitForSelectorAction(
selector="css:script#__NEXT_DATA__",
time=8000,
state="attached",
),
]) -
NetworkCaptureConfig.url_pattern— glob pattern that asks the server to capture the response body of matching requests, in addition to the usual metadata. Useful for grabbing OAuth / Firebase tokens, GraphQLpersistedQuerypayloads, or any internal API response without re-running the request yourself.Bodies are capped at 64 KB; larger responses come back with
body_truncated: true. Body fetch has a 5 s timeout — if it expires, the entry gets abody_errorfield instead ofbody(the scrape itself never hangs).from scrapingpros import NetworkCaptureConfig
result = client.scrape(url, browser=True, network_capture=NetworkCaptureConfig(
resource_types=["xhr", "fetch"],
url_pattern="*identitytoolkit.googleapis.com*",
))
for entry in result.network_requests or []:
if "body" in entry:
token = parse_token(entry["body"]) -
JobExecutionPublic.has_extractable_data(bool | None) — whether the page contained structured data the server could extract (JSON-LD, microdata, OpenGraph,__NEXT_DATA__). Independent ofis_success: a 200 page with usable text content can still have no machine-parseable payload. -
JobExecutionPublic.validator_version(str | None) — version of the HTML Validator that producedis_success,block_reason,protection_stack, andrule_hitsfor the job. Pin it in integration tests to catch silent classifier upgrades:for job in client.iter_run_jobs(col.id, run.run_id):
assert job.validator_version == "0.1.6" -
JobExecutionPublic.client_id(str | None) — the client account that owns the job. Useful when working across multiple tenants.
Deprecated
Both deprecations are docstring-only — existing code keeps working with no runtime warning. They're flagged so new code can avoid them and a future major release can remove them cleanly.
-
ScrapeRequest.browser_type— the API now picks the right engine per domain via internal routing. New code should choose only betweenbrowser=True(5 credits, full rendering) andbrowser=False(1 credit, fast path) and let the server handle the rest. Existing code that passes"light"/"heavy"/"stealth"still works. -
ScrapeResponse.potentiallyBlockedByCaptcha— preferresponse.guidance.successfor the canonical verdict.guidancealso tells you why a request failed and what to try next (error_type,error_provider,next_steps,suggested_request), which the legacy boolean cannot.# Old
if resp.potentiallyBlockedByCaptcha:
retry()
# New
if not resp.guidance.success:
print(resp.guidance.error_type, resp.guidance.next_steps)
retry_with(**resp.guidance.suggested_request)
Removed
Nothing removed in 0.5.0. The deprecations above will be candidates for removal in a future major release.
0.4.3 — 2026-04-24
Added
JobExecutionPublic.is_success(bool | None) — server's authoritative verdict for whether a job produced usable content. Catches soft-blocks (Google CAPTCHA pages with 200 + large body, Amazon "Robot Check") that a naivestatus_codecheck misses.RunPublic.success_criterion— exposes the active success policy (version,rules) so you can pin it in tests.
Changed
Batch.iter_results()now honours the serveris_successverdict internally —result.guidance.successreflects it without extra effort.
0.4.2 — 2026-04-24
Fixed
- Batch polling no longer hangs on transient
ConnectionErrorduring worker restarts. The SDK now retries with backoff instead of raising. since_completed_atpolling correctly resumes after partial failures, avoiding duplicate result delivery.- Guidance fallback for jobs created before the server-side
guidancerollout: the SDK now reconstructs basic guidance client-side soresult.guidance.successis always populated.
0.4.1 — 2026-04-23
Added
- Cursor-based pagination for
get_run_jobs()anditer_run_jobs()with server-side filters (status_filter,since_completed_at). Scales cleanly to runs with 50,000+ jobs without timing out.
0.4.0 — 2026-04-23
Added
-
Batch API — the headline feature for production-scale scraping:
client.submit_batch(name, requests)— submit any number of URLs at once.batch.iter_results()— stream results as workers finish them, with progress (pct,eta_seconds,success_count,failed_count).- Per-job callbacks, automatic resume, configurable timeouts.
Submit 50,000 URLs, walk away, come back to handled results.
0.3.0 — 2026-04-23
Added
ScrapeRequest.custom_id— round-trip a string through the API to map results back to your database without depending on order. Echoed inScrapeResponse.custom_idandJobExecutionPublic.custom_id.MethodPOST.url— POST to a different endpoint than the navigation target. Useful for sites that set cookies on one URL but expose data on a separate API/GraphQL endpoint.
0.2.4 — 2026-04-10
Added
scrape_many()extended with allscrape()parameters and a heterogeneous mode (each URL can carry its own per-request configuration).
0.2.3 — 2026-04-10
Added
browser_type="stealth"mode for hardened anti-bot sites.block_resources(image/font/media/etc.) andblock_requests(URL substring blocklist) to speed up browser scrapes by stripping unnecessary resources and trackers.
Changed
- Default
browser_typeis now"light"— significantly more concurrent throughput than"heavy"for the vast majority of sites.
0.2.2 — 2026-04-09
Added
ScrapeGuidanceon every response. Tells you why a scrape failed and what to do next, in a structured form (success,error_type,error_provider,next_steps,suggested_request,stop_reason).- Multi-mode viability testing — try several scraping strategies against a URL in one call to find what works before building a full pipeline.
Older versions
For releases before 0.2.2, see the PyPI release history.